Background

The goal of this project is to implementing a machine learning algorithm for predicting if a subject is performing weight lifting exercise correctly.

The Weight Lifting Exercises (WLE) dataset (source: http://groupware.les.inf.puc-rio.br/har) is used in the project. The dataset was collected by recording signals from wearable sensors while the subjects perform weight lifting activities. Six young healthy participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. More information about the WLE dataset can be found in [1], or visit http://groupware.les.inf.puc-rio.br/har#ixzz3xhadU0A4. The training data can be dowloaded here https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The WLE training dataset consists of 160 variables and 19622 observations of the variables. The “classe” variable in the training set (values are A, B, C, D or E) is the outcome that the algorithm should predict. The training dataset is partitioned into 2 parts. A 75% portion is used for model training and cross validation and a 25% portion as test data for estimating the out-of-sample error. The final chosen algorithm is then applied for predicting the outcome of the 20 test cases. The test data can be downloaded here : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The caret package is used in this project for building and evaluation of the machine learning algorithms.

Features selection and model training

Data cleaning

Some exploratory analysis on the training dataset shows that columns 1 to 7 consist of variables not directly obtained from the wearable sensors. Various plots showing the relationship of these variables to the “classe” variables are investigated.

Here, 3 of the plots are shown. The first plot shows that variable “X” in column 1 is some indexing for dataset which is sorted by the “classe” variable. Columns 2 to 7 consist of timestamps and measurement window related data. By plotting variables against any sensor measured variable and group by “classe”, they do not show meaningful correlation to “classe”. It seems more like it indicates the different time the subjects are performing the exercises.

Based on these findings, columns 1 to 7 are manually dropped from the features set.

Dealing with missing values

Many of the remaining variables are of class ‘factor’ but values are ‘numeric’. In addition, some seem to be sparse (with many missing values and #DIV/0!). There are two considerations dealing with these sparse columns. One is to find out how sparse and another is whether some are still significant despite the sparsity. These variables are first converted to numeric (using as.numeric(as.character()), so that missing entry is given value ‘NA’.

To address the first question, the percentage of ’NA’s in each columns are calculated. A hundred columns were found to have more than 98% of ’NA’s. With such high percentage of missing values it’s too sparse to consider these variables. These 100 columns are removed, leaving the dataset with 53 variables, including the classe variable. Further analysis on near-zero variance found no nzv variables. This dataset is then use for model building.

Model training and cross validation

The above cleaned training dataset is used for building the machine learning algorithm. As the prediction of ‘classe’ is a classification problem, classification by tree and random forest are considered in this project. Four cases were considered and all cases employ k-fold cross validation, with k=8.

  • Case 1 : Classification tree
  • Case 2 : Bagging classification trees (25 bootstrap replication)
  • Case 3 : Bagging classification trees (25 bootstrap replication) and preprocessed with pca
  • Case 4 : Random forest

The trained model performance of each case with 8-fold cv are as follow.

Model_tree
## CART 
## 
## 14718 samples
##    51 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 12878, 12878, 12878, 12878, 12879, 12878, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.03588721  0.5010191  0.3487380  0.01662134   0.02164618
##   0.05879933  0.4442834  0.2572766  0.06313123   0.10670032
##   0.11535175  0.3352348  0.0778774  0.04236619   0.06478229
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03588721.
Model_treebag
## Bagged CART 
## 
## 14718 samples
##    51 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 12877, 12878, 12879, 12878, 12877, 12878, ... 
## Resampling results
## 
##   Accuracy   Kappa     Accuracy SD  Kappa SD   
##   0.9848482  0.980833  0.00355376   0.004496107
## 
## 
Model_treebag1   # with pca
## Bagged CART 
## 
## 14718 samples
##    51 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (52), centered
##  (52), scaled (52) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 12877, 12879, 12879, 12878, 12879, 12878, ... 
## Resampling results
## 
##   Accuracy   Kappa     Accuracy SD  Kappa SD   
##   0.9548844  0.942929  0.006905228  0.008733146
## 
## 
Model_rf
## Random Forest 
## 
## 14718 samples
##    51 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 12878, 12879, 12878, 12878, 12878, 12877, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9914394  0.9891703  0.002354663  0.002979013
##   27    0.9923906  0.9903742  0.002613584  0.003305818
##   52    0.9863437  0.9827241  0.003526208  0.004460758
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The plain classificaiton tree shows a prediction accuracy of only 50.1019%. Preprocessing with pca (which results in 25 pcs) did not improve prediction accuracy for the bagged tree method but instead deteriorate. The bagged classification tree and random forest models give accuracy of 98.4848% and 99.2391%, respectively. Both the treebag (case 2) and random forest (case 4) are tested on the test data (the 25% of training data) to estimate the out-of-sample error.

Out-of-sample test

Following are the confusion matrices of the bagged classification and random forest algorithms on predicting the output from the test portion of the training data.

confMatTbag
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1389    4    2    0    0
##          B    7  936    5    1    0
##          C    0    4  846    5    0
##          D    0    0   13  790    1
##          E    0    1    3    8  889
## 
## Overall Statistics
##                                           
##                Accuracy : 0.989           
##                  95% CI : (0.9857, 0.9917)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9861          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9950   0.9905   0.9735   0.9826   0.9989
## Specificity            0.9983   0.9967   0.9978   0.9966   0.9970
## Pos Pred Value         0.9957   0.9863   0.9895   0.9826   0.9867
## Neg Pred Value         0.9980   0.9977   0.9943   0.9966   0.9998
## Prevalence             0.2847   0.1927   0.1772   0.1639   0.1815
## Detection Rate         0.2832   0.1909   0.1725   0.1611   0.1813
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9966   0.9936   0.9857   0.9896   0.9979
confMatRF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1392    3    0    0    0
##          B    4  942    3    0    0
##          C    0    3  848    4    0
##          D    0    0    8  795    1
##          E    0    0    1    4  896
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9937         
##                  95% CI : (0.991, 0.9957)
##     No Information Rate : 0.2847         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.992          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9971   0.9937   0.9860   0.9900   0.9989
## Specificity            0.9991   0.9982   0.9983   0.9978   0.9988
## Pos Pred Value         0.9978   0.9926   0.9918   0.9888   0.9945
## Neg Pred Value         0.9989   0.9985   0.9970   0.9980   0.9998
## Prevalence             0.2847   0.1933   0.1754   0.1637   0.1829
## Detection Rate         0.2838   0.1921   0.1729   0.1621   0.1827
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9981   0.9960   0.9922   0.9939   0.9988

As expected, from the confusion matrices, the random forest algorithm gives a higher accuracy of 99.37% compared to 98.9% of the bagged tree classification method. The estimated out-of-sample error, calculated as (1- Accuracy)*100%, for the bagged tree and random forest algorithms are 1.101% and 0.632%, respectively. Based on this, the random forest algorithm is going to be used for predicting the 20 test cases for the quiz.

Final results

The random forest machine learning is used to predict the 20 test cases of the Coursera Practical Machine Learning quiz. The following prediction of the 20 cases are submitted and the algorithm has predicted all 20 cases correctly.

B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B

Reference

[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.